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Abstract 

Estimating large covariance and precision matrices are fundamental in modern mul¬ 
tivariate analysis. The problems arise from statistical analysis of large panel economics 
and finance data. The covariance matrix reveals marginal correlations between vari¬ 
ables, while the precision matrix encodes conditional correlations between pairs of 
variables given the remaining variables. In this paper, we provide a selective review of 
several recent developments on estimating large covariance and precision matrices. We 
focus on two general approaches: rank based method and factor model based method. 
Theories and applications of both approaches are presented. These methods are ex¬ 
pected to be widely applicable to analysis of economic and financial data. 

Keywords: High-dimensionality, graphical model, approximate factor model, principal 
components, sparse matrix, low-rank matrix, thresholding, heavy-tailed, elliptical distribu¬ 
tion, rank based methods. 

1 Introduction 

Estimating large covariance and precision (inverse covariance) matrices becomes funda¬ 
mental problems in modern multivariate analysis, which find applications in many fields, 
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ranging from economics and finance to biology, social networks, and health sciences (Fan 


et al. 2014a). When the dimension of the covariance matrix is large, the estimation problem 


is generally challenging. It is well-known that the sample covariance based on the observed 
data is singular when the dimension is larger than the sample size. In addition, the aggrega¬ 
tion of massive amount of estimation errors can make considerable adverse impacts on the 
estimation accuracy. Therefore, estimating large covariance and precision matrices attracts 
rapidly growing research attentions in the past decade. 

In recent years researchers have proposed various regularization techniques to consistently 
estimate large covariance and precision matrices. To estimate large covariance matrices, one 
of the key assumptions made in the literature is that the target matrix of interest is sparse, 


namely, many entries are zero or nearly so (Bickel and Levina, 2008 Lam and Fan, 2009 


El Karoui 2010 Rigollet and Tsybakov[ 2012). To estimate large precision matrices, it is 


often the case that the precision matrix is sparse. A commonly used method for estimating 
the sparse precision matrix is to employ an i \-penalized maximum likelihood, see for instance, 


Banerjee et al. (2008); Yuan and Lin (2007); Friedman et al. (2008); Rothman et al. (2008). 


To further reduce the estimation bias, Lam and Fan (2009); Shen et al. (2012) proposed 
non-convex penalties for sparse precision matrix estimation and studied their theoretical 


properties. For more general theory on penalized likelihood methods, see Fan and Li (2001); 


Fan and Peng (2004); Zou (2006); Zhao and Yu (2006); Bickel et al. (2009); Wainwright 


(2009). 


The literature has been further expanded into robust estimation based on regularized 


rank-based approaches (Liu et al. 2012a Xue and Zou, 2012). The rank-based method is 
particularly appealing when the distribution of the data generating process is non-Gaussian 
and heavy-tailed. It is particularly appealing for analysis of financial data. The literature 


includes, for instance, Han and Liu (2013); Wegkamp and Zhao (2013); Mitra and Zhang 


(2014), etc. The heavy-tailed data are often modeled by the elliptical distribution family, 
which has been widely used for financial data analysis. See Owen and Rabinovitch (1983); 


Hamada and Valdez 

(2004 

) and 

Frahm and Jaekcl 

(2008) 


In addition, in many applications the sparsity property is not directly applicable. For 
example, financial returns depend on the equity market risks, housing prices depend on the 
economic health, gene expressions can be stimulated by cytokines, among others. Due to the 
presence of common factors, it is unrealistic to assume that many outcomes are uncorrelated. 
A natural extension is the conditional sparsity , namely, conditional on the common factors, 
the covariance matrix of the remaining components of the outcome variables is sparse. In 
order to do so, we consider a factor model. The factor model is one of the most useful 
tools for understanding the common dependence among multivariate outputs, which has 
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broad applications in the statistics and econometrics literature. For instance, it is commonly 
used to measure the vector of economic outputs or excessive returns of financial assets over 
time, and has been found to produce good out-of-sample forecast for macroeconomic vari¬ 
ables (Boivin and Ng, 2005 Stock and Watson, 2002). In high dimensions, the unknown 
factors and loadings are typically estimated by the principal components method, and the 
separations between the common factors and idiosyncratic components are characterized via 


pervasiveness assumptions. See, for instance, Stock and Watson (2002); Bai (2003); Bai 


and Ng 

(2002); Fan et a] 

. (2008); Breitung and Tenhofen ( 

2011); 

Gnats! 

d (2012); 

Lam and 

Yao 

2012); Fan et al. (i 

2013), among others. In the stati 

stical literature, the separations 
are carried out by the low-rank 

between the common fac 

plus sparsity decomposit 

tors and idiosyncratic components 

ion. See, for example, Candes and Recht 

(2009): 

Koltchinskii et al. 

(2011 

); 

Fan et al. (2011), 

Negahban and Wainwright (2011 

; Cai et al. (2 

013); Ma 

(2013). 


In this paper, we provide a selective review of several recent developments on estimating 
large covariance and precision matrices. We focus on two general approaches: rank-based 
method and factor model based method. Theories and applications of both approaches are 
presented. Note that this paper is not an exhaustive survey, and many other regularization 


methods are also commonly used in the literature, e.g., the shrinkage method (Ledoit and 


Wolf, 2003, 2004). We refer to Fan and Liu (2013), Pourahmadi (2013) and the references 


therein for reviews of other commonly used methods. 

This paper is organized as follows. Section 2 presents methods of estimating sparse 
covariance matrices. Section 3 reviews methods of estimating sparse precision matrices. 
Section 4 discusses robust covariance and precision matrix estimations using rank-based 
estimators. Sections 5 and 6 respectively presents factor models based method, respectively 
in the cases of observable and unobservable factors. Section 7 introduces the structured 
factor model. Finally, Section 8 provides further discussions. 

Let A min (A) and A max (A) respectively denote the minimum and maximum eigenvalues 
of A. Let VVax(A) be the largest singular value of A. We shall use || A || 2 and ||A|| F 
to denote the operator norm and Frobenius norm of a matrix A, respectively defined as 
Amax(A / A) and tr 1,/2 (A / A). Throughout this paper, we shall use p and T to respec¬ 
tively denote the dimension of the covariance matrix of interest, and the sample size. 
Let v = (vi,... ,v p )' e W be a real valued vector, we define the vector norms: llvl^ = 




• 3=1 1 ^ 1 ’ 


|v|| 2 = J2 P j=i v h IMloo = maxi<j< p \vj\. Let S be a subspace of 


we use 


vs to denote the projection of v onto S\ vs = argmin ue5 ||u — v|||. We also define the 
orthogonal complement of S as S 1 = {u e u'v = 0, for any v G 5}. Let A e W xp 
and I, J C {1,...,A} be two sets. Denote by A i t j the submatrix of A with rows and 
columns indexed by / and J. Letting A = (Ay,..., A P j)' and A&* = (A*,i,..., Ak p )' 
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denote the j th column and k th row of A in vector forms, we define the matrix norms: 
||A||i = maxj ||A*j||i, HAHoo = max*, ||A*,*||i, ||A|| max = max.,- HA^Hoq. We also define 
matrix elementwise (pseudo-) norms: ||A||i )0 fr = IA?fcl an d ||A||oo,off — max^j, | Ajk\. 

We write a n x b n if there are positive constants C\ and c 2 independent of n such that 
Ci b n ct n C 2 b n . 


2 Estimating sparse covariance matrix 

Let Y it be the observed data for the i th (i = individual at time t = 1 (or 

the t th observation for the i th variable). We are interested in estimating the pxp covariance 
matrix £ = ( crij) p x p of Y t — (Y u , ...,Y pt )', assumed to be independent of t. The sample 
covariance matrix is defined as 

T T 

S = A- Y)(Y,-Y)’, 

t=l i=l 


When p > T, however, it is well-known that S is singular. It also accumulates many 
estimation errors due to the large number of free parameters to estimate. 

Sparsity is one of the most essential assumptions for high-dimensional covariance matrix 
estimation, which assumes that a majority of the off-diagonal elements are nearly zero, and 
effectively reduces the number of free parameters to estimate. Specifically, it assumes that 
there is q > 0, so that the following defined quantity 


m p = 


maxj< p YJj=i H a ij ± °}, 
maxj< p YJj=i \ a v\ q i 


if q = 0 
if 0 < q < 1 


( 1 ) 


is either bounded or grow slowly as p —>• oo. Here 1{-} denotes the indicator function. Such 
an assumption is reasonable in many applications. For instance, in a longitudinal study 
where variables have a natural order, variables are likely weakly correlated when they are 
far apart (jWu and Pourahmadh 2003). Under the sparsity assumption, many regularization 
based estimation methods have been proposed. This section selectively overviews several 
state-of-the-art statistical methods for estimating large sparse covariance matrices. 


2.1 Thresholding estimation 


One of the most convenient methods to estimate sparse covariance matrices is the thresh¬ 


olding, which sets small estimated elements to zero (Bickel and Levina, 2008). Let Sij be the 
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(■ i,j) th element of S. For a pre-specified thresholding value oj t , define 


S — (( Tij ) 


py.pi 


a ij — 




if i = j 


s ij l{\s ij \> u T }, if i^j 


( 2 ) 


The thresholding value should dominate the maximum estimation error max,^ |sy — a t] \. 
When the data are Gaussian or sub-Gaussian, it can be taken as 


cu T = cy :for some C > 0 

so that the probability of the exception event {max^-1 s tJ — a tJ > cap} tends to zero very 
fast. 

The advantage of thresholding is that it avoids estimating small elements so that noise 
does not accumulate. The decision of whether an element should be estimated is much easier 


than the attempt to estimate it accurately. Indeed, under some regularity conditions, Bickel 
(2008) showed that, if m p uxf q —>• 0 as p, T -)• oo, we have 


and Levina 


|£ — 521| 2 = Op(mpU>T q ) and ||52 1 — S 1 || 2 = Op(m p uj q ), 


(3) 


where m p and q are as defined in ([Tj) . In the case that all the “small” elements of S are 
exactly zero so that we take q — 0, the above convergence rate becomes Op{\J~^^-) if m p 
is bounded. Since each element in the covariance matrix can be estimated with an error of 
order Op(T~ 1//2 ), it hence only costs us a log(p) factor to learn the unknown locations of the 
non-zero elements. 


2.2 Adaptive thresholding and entry-dependent thresholding 

The simple thresholding (J2| does not take the varying scales of the marginal standard 
deviations into account. One way to account this is to threshold on the t-type statistics. For 
example, using the simple thresholding, we can define the adaptive thresholding estimator 


(Cai and Liu, 2011): 


S = (a.. 


ij Jpxpi 


a ij 




if i = j 


5 ijl{l s *jl/ SE («ij) > up}, if i^j 


(4) 


where SE(s^) is the estimated standard error of s t ,j. 

A simpler method to take the scale into account is to directly apply thresholding on the 


5 












correlation matrix. Let R = diag(S)~ 1 // 2 Sdiag(S )~ 1//2 = ( U^pxp be the sample correlation 
matrix. We then apply the simple thresholding on the off-diagonal elements of R, and obtain 
the thresholded correlation matrix R r . So the ( i,j) th element of R r is 1 {| r tJ > u>t} when 
i 7 ^ j, and one if i = j. Then the estimated covariance matrix is defined as 

S* = diag(S) 1 / 2 R r diag(S) 1/2 . 

In particular, when = 0, it is exactly the sample covariance matrix since no thresholding 
is employed, whereas when ujt — 1 , it is a diagonal matrix with marginal sample variances 
on its diagonal. This form is more appropriate than the simple thresholding since it is 
thresholded on the standardized scale. Moreover, E* is equivalent to applying the entry 
dependent thresholding 

0JT,ij y/ SiiSjjCJT 

to the original sample covariance S. 


2.3 Generalized thresholding 

The introduced thresholding estimators (J2]) and Q are based on a simple thresholding 
rule, known as the hard-thresholding. In regression and wavelet shrinkage contexts (see, for 


example, Donoho et al. (1995)), hard thresholding performs worse than some more flexible 
regularization methods, such as the soft-thresholding and the smoothly clipped absolute 
deviation (SCAD) (Fan and Li, 2001), which combine thresholding with shrinkages. The 


estimates resulting from such shrinkage typically are continuous functions of the maximum 
likelihood estimates (under Gaussianity), a desirable property that is not shared by the hard 
thresholding method. 

Therefore, the generalized thresholding rules of Antoniadis and Fan (20011 can be applied 
to estimating large covariance matrices. The generalized thresholding rule depends on a 
thresholding parameter ujt and a shrinkage function : M. —> M, which satisfies 

(i) \Ii(z,ujt)\ < M; ( ii ) h(z-,u>T ) = 0 for \z\ < ujt', (Hi) \h(z]u>T) — z\ < ujt- 

There are a number of useful thresholding functions that are commonly used in the litera¬ 
ture. For instance, the soft-thresholding takes h(z]UT ) =sgn( 2 )(|^| — cot)+, where (x) + = 
maxjz, 0}. Moreover, the SCAD thresholding is a compromise between hard and soft thresh¬ 
olding, whose amount of shrinkage decreases as \z\ increases and hence results in a nearly 


unbiased estimation. Another example is the MCP thresholding, proposed by Zhang (2010). 
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We can then define a generalized thresholding covariance estimator: 


S = (a i:j ) 


pxpt 






if l = J 


h(sij\u} T ), if i^j 


( 5 ) 


Note that this admits the hard-thresholding estimator ([2]) as a special case by taking h(z] cut) = 
*1(1 z\ > c ut}- Both the adaptive thresholding and entry dependent thresholding can also 
be incorporated, by respectively setting h(sij, and /i(sy, y/s ll s ]J ujT) on the (i,j) th 


element of the estimated covariance matrix when i ^ j. In addition, it is shown by Roth¬ 


man et al. (2009) that the use of generalized thresholding rules does not affect the rate of 


convergence in (J3]), but it increases the family of shrinkages. 


2.4 Positive definiteness 

If the covariance matrix is sparse, it then follows from ([3]) that the thresholding estimator 
S is asymptotically positive definite. On the other hand, it is often more desirable to require 
the positive definiteness under finite samples. We discuss two approaches to achieving the 
finite sample positive definiteness. 


2.4.1 Choosing the thresholding constant 

For simplicity, we focus on the constant thresholding value cur,ij = cut ; the case of entry- 
dependent thresholding can be dealt similarly. The finite sample positive definiteness de¬ 
pends on the choice of the thresholding value Ut, which also depends on a prescribed constant 
C through ut = We write £(C) = S to emphasize its dependence on C. When C 

is sufficiently large, the estimator becomes diagonal, and its minimum eigenvalue is strictly 
positive. We can then decreases the choice of C until it reaches 


C min = inf (C > 0 : A min (S(M)) > 0, VM > C}. 


Thus, C min is well defined and for all C > Cm in . S(C') is positive definite under finite 
sample. We can obtain C m i n by solving A m i n (S(C')) = 0, C ^ 0. Figure [l] plots the minimum 
eigenvalue of E((7) as a function of C for a random simple from a Gaussian distribution 
with p > T, using three different thresholding rules. It is clearly seen from the figure that 
there is a range of C in which the covariance estimator is both positive definite and non¬ 
diagonal. In practice, we can choose C in the range (C m \ n + e, M) for a small e and large 


enough M by, e.g., cross-validations. This method was suggested by Fan et al. (2013) in a 


more complicated setting. Moreover, we also see from Figure [l] that the hard-thresholding 
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rule yields the narrowest range for the choice C to give both positive definiteness and the 
non-diagonality. 


Figure 1: Minimum eigenvalue of E(C) as a function of C for three choices of thresholding 
rules. When the minimum eigenvalue reaches its maximum value, the covariance estimator 
becomes diagonal. 



2.4.2 Nearest positive definite matrices 


An alternative approach to achieving the finite sample positive definiteness is through 


solving a constraint optimization problem. Qi and Sun (2006) introduced an algorithm 


for computing the nearest correlation matrix-, recall that R r is the thresholded correlation 
matrix, defined in Section 2.2, we find its nearest positive definite correlation matrix R by 
solving: 


R = argmin 11R 

A 


r 


ah!, 


s.t. A > 0, diag(A) = I p . 


We can then transform back to the covariance matrix as: S = diag(S) 1 / 2 Rdiag(S) 1 / 2 . Note 
that if R r itself is positive semi-definite, R = R r ; otherwise R is the nearest positive 
semi-definite correlation matrix. This procedure is often called “nearest correlation matrix 
projection”, and can be solved effectively using the R-package “nearPD”. 

The nearest correlation matrix projection, however, does not necessarily result in a sparse 
solution when R r is not positive definite. Liu et al. (2014a) introduced a covariance esti¬ 


mation method named EC2 (Estimation of Covariance with Eigenvalue Constraints). To 
motivate this method, note that the thresholding method ([5]) can be equivalently casted as 















the solution to a penalized least squares problem: 


E 


argmin 

s=0 ij) 


|||S- S 



where P UT {) is a penalty function, which corresponds to the shrinkage function For 

instance, when 

P UT {t) = uj{- (|t| - cu T ) 2 l{|t| < cu T }, 


the solution is the hard-thresholding estimator ([ 2 ]) (Antoniadis (1997)). See Antoniadis and 


Fan (2001) for the corresponding penalty functions of several popular shrinkage functions. 


The sparsity of the resulting estimator is hence due to the penalizations. We can modify 
the above penalized least squares problem by adding an extra constraint to obtain positive 
definiteness: 


E = argmin —1| S — E 

Amin(S)>T 


E* 

*7U 


ur\ a ij/ 


( 6 ) 


where r > 0 is a pre-specihed tuning parameter that controls the smallest eigenvalue of 
the estimated covariance matrix E. As a result, both sparsity and positive definiteness are 
guaranteed. Liu et al. (2014a) showed that the problem ([ 6 ]) is convex when the penalty 
function is convex, and develops an efficient algorithm to solve it. More details on the 
algorithm and theory of this estimator will be explained in Section 4. 


3 Estimating sparse precision matrix 

Estimating a large inverse covariance matrix © = E^ 1 is another fundamental problem 
in modern multivariate analysis. Unlike the covariance matrix E which only captures the 
marginal correlations among Y t = (Yu,..., Y pt )', the inverse covariance matrix © captures 
the conditional correlations among these variables and is closely related to undirected graphs 
under a Gaussian model. 

More specifically, we define an undirected graph G = (V,E), where V contains nodes 
corresponding to the p variables in Y t and the edge (j, k) e E if and only if &jk ^ 0. 
Under a Gaussian model Y t ~ N(0, E), the graph G describes the conditional independence 
relationships among Y t = ( Y u ,..., Y pt )'. More specifically, let 17,\{j,fc} = {Yu '■ d f- j, k}. Yj t 
is independent of Y^t given Y t) \{j t k} for all (j, k ) ^ E. 

To illustrate the difference between the marginal and conditional uncorrelatedness. We 
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consider a Gaussian model Y t ~ N( 0, S) with 
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We see that the inverse covariance matrix © has many zero entries. Thus the undirected 
graph G defined by © is sparse. However, the covariance matrix S dense, which implies that 
every pair of variables are marginally correlated. Thus the covariance matrix and inverse 
covariance matrix encode different relationships. For example, even though Y\ t and Y 5t 
are conditionally uncorrelated given the other variables, they are marginally correlated. In 
addition to the graphical model problem, sparse precision matrix estimation has many other 


applications. Examples include high dimensional discriminant analysis (Cai et al. 2011), 


portfolio allocation ( 

Fan et al. 

2008, 

2012 

), principal component analysis, and complex data 

visualization ( 

Tokuda et al. 

2011 

)• 



Estimating the precision matrix © requires very different techniques from estimating 
the covariance matrix. In the following subsections, we introduce several large precision 
estimation methods under the assumption that © is sparse. 


3.1 Penalized likelihood method 

One of the most commonly used approaches to estimating sparse precision matrices 
is through the maximum likelihood. When Y(, • • • ,Yp are independently and identically 
distributed as 1V(0, £), the negative Gaussian log-likelihood function is given by £(©) = 
tr (S©) — log |©|. When either the data are non-Gaussian or the data are weakly dependent, 
£(©) becomes the quasi negative log-likelihood. Nevertheless, we then consider the following 
penalized likelihood method: 


© = argmin 

® = )pXp 


|tr(S©)-log|©| + ^P WT (|%|)} 


where the penalty function P UT (1 9,j \), defined the same way as in Section 2.4.2, encourages 
the sparsity of ©. One of the commonly used convex penalty is the t\ penalty P WT (t) = 


and the problem is then well studied in the literature (e.g., Yuan and Lin (2007); Friedman 


et al. (2008); Banerjee et al. (2008)). Other related works are found in, e.g., Meinshausen 


and Biihlmann (2006a); Wille et ah] (2004) 
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In general, we recommend to use folded concave penalties such as SCAD and MCP, as 
these penalties do not introduce extra bias for estimating nonzero entries with large absolute 


values (Lam and Fan, 2009). Using local linear approximations, the penalized likelihood can 


be computed by an iterated reweighed Lasso: Given the estimate ©^ = (O^) at the k th 
iteration, by the Taylor’s expansion, we approximate 


fW(l%l)» fW(l«f I) + KSWWii\ - l«fl) = Q.A 1%). 


The linear approximation Q Ut is the convex majorant of the folded concave function at \d^\, 
namely, it satishes 


Pu> T (\dij\) < QurQOij]), 



Then the next iteration is approximated by 


@0+i) _ ar g m j n 


tr(S©) — log |©| 


E 




1 ) 10 , 


+ c ) 


(7) 


where c is a constant that does not depend on ©. The problem (Jt]) is convex and can 
be solved by the graphical Lasso algorithm of Friedman et al. (2008). Such an algorithm is 


called majorization-minimization algorithm (Lange et ah, 2000). Since the penalty function is 


majorized from above, it can easily be shown that the original objective function is decreasing 
in the iterations. Indeed, let /(©) = tr(S©) — log |@| + . P UT (\Qij\) be the target value 

and g(&) be its majorization function with P UT (1 6 t] \) replaced by Qo; T (|%|)- Then, 

/(©U+D) < c/(©( fc+ b) < g(@W) = /(© (fc) ), 


where the first inequality follows from the majorization, the second inequality comes from 
the minimization, and the last equality follows the majorization at the point 0^. 
Theoretical properties of © have been thoroughly studied by 


Rothman et al. 


(2008) and 


Lam and Fan (2009). 


3.2 Column-by-column estimation method 

Under the Gaussian model Y t ~ N( 0, E), another approach to estimating the precision 


matrix © is through column-by-column regressions. For this, Yuan (2010) and Cai et al. 


(2011) propose the graphical Dantzig selector and CLIME respectively, which can be solved 


by linear programming. More recently, Liu and Luo (2012) and Sun and Zhang (2012) 
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propose the SCIO and scaled-Lasso methods. Compared to the penalized likelihood methods, 
the column-by-column estimation methods are computationally simpler and more amenable 
to theoretical analysis. 

The column-by-column precision matrix estimation method exploits the relationship be¬ 
tween conditional distribution of multivariate Gaussian and linear regression. More specifi¬ 
cally, let Y ~ iV(0, X), the conditional distribution of Yj given Ysj satishes 

where otj = ('E\j t \j)~ 1 'E\jj G M p_1 and aj = Tijj — 'E\jj('E\j t \j)~ 1 'E\jj. Hence, we can write 


Yj cXjY\j + 


( 8 ) 


where €j ~ iV(0, aj') is independent of Y\j. Using the block matrix inversion formula, we 
have 

(9) 


& 33 = CT j 2 ’ 


®\Jj a j a j 


Therefore, we can recover © in a column-by-column manner by regressing Yj on Y 3 for 
j = 1,2 ,■■■ ,p. For example, let Y G M Txp be the data matrix. We denote by otj : = 
(ayi,..., OLj( p - i )) 7 G M p_1 . Meinshausen and Biihlmann (2006b) propose to estimate each exj 


by solving the Lasso regression: 


^ lii 112 11 11 

aj = argmin — - Y^-c^ + Xj 

Q, 6RP- 1 11 


where Xj is a tuning parameter. Once otj is obtained, we get the neighborhood edges by 
reading out the nonzero coefficients of otj. The final graph estimate G is obtained by either 
the “AND” or “OR” rule on combining the neighborhoods for all the N nodes. To estimate 


©, we also estimate the aj 's using the fitted sum of squared residuals aj = T 1 11 Y : 
Y^jOtj ll^, then plug it into ([9]). 


In another work, Yuan (2010) proposes to estimate otj by solving the Dantzig selector: 


otj = argmin 11 ck j 11 subject to 


aj eiRp - 1 




where S := T -1 Y'Y is the sample covariance matrix and 7 j is a tuning parameter. The 
constraint corresponds to a sample version of — J}\j j \jCtj = 0 , with 7 j indicating the 
estimation error. Once otj is given, we can estimate aj by aj = [l — 2q / ? S\jj + 

We then obtain an estimator © of © by plugging otj and aj into (|9|). Yuan (2010) analyzes 
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the Za-norm error ||© — ©|| x and shows its minimax optimality over certain model space. 


More recently, Sun and Zhang (2012) propose to estimate ctj and Oj by solving a scaled- 
Lasso problem: 


f b'Sb a I , 1 

< ^ + A 2^ Skk \ bk \ sub Ject to b j = -1 >• 

^ ° k =1 ' 


b j,aj = argmin 

b=(6i,...,6p)',cr 


Once bj is obtained, we estimate ctj = ( 6 i,..., bj-i, bj + i ,..., b p )' . We then obtain the 
estimator of © by plugging ctj and a 3 into(j9]) . Sun and Zhang (2012) provide the spectral- 
norm rate of convergence of the obtained precision matrix estimator. 


Similar to the idea of the graphical Dantzig selector, Cai et al. (2011) propose the CLIME 


estimator, which stands for “Constrained l \-Minimization for Inverse Matrix Estimation 1 ' 
This method directly estimates the j th column of © by solving 


©*j = argmin11©*j 11 subject to ||S© 




*3 


<S 3 , for j = l,...,p, 


where e 3 is the j th canonical vector (i.e., the vector with the j th element being 1 , while the 
remaining elements being 0) and 8j is a tuning parameter. Again, the constraint represent a 
sample version of E©*j — e 3 = 0. This optimization problem can be formulated into a linear 


program and has the potential to scale to large problems. Under regularity conditions, Cai 


et ah 


( 2011 ) show that the estimator © is asymptotically positive definite, and derive its 


rate of convergence. 

In a closely related work of CLIME, Liu and Luo (2012) propose the SCIO estimator, 
which estimates the j th column of © by 

©*j = argmin| ^©CS©*,- - + Aj ||@* j -|| 1 

The SCIO estimator can be solved efficiently by the pathwise coordinate descent algorithm 


(Friedman et al. 2007). 


3.3 Tuning-insensitive precision matrix estimation 

Most of the methods described in the former subsection require choosing some tuning pa¬ 
rameters that control the bias-variance tradeoff. Their theoretical justifications are usually 
built on some theoretical choices of tuning parameters that cannot be implemented in prac¬ 
tice. For example, in the neighborhood pursuit method and the graphical Dantzig selector, 
the theoretically optimal tuning parameters X 3 and 7 j depend on cr|, which is unknown. The 
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optimal tuning parameters of the CLIME and SCIO depend on ||©||i, which is unknown. 


3.3.1 The TIGER method 


To handle the challenge of tuning parameter selection, Liu and Wang (2012) propose the 


TIGER (Tuning-Insensitive Graph Estimation and Regression) method, which is asymptot¬ 
ically tuning-free and only requires very few efforts to choose the regularization parameter 
in finite sample settings. 

The idea of TIGER is to estimate the precision matrix © in a column-by-column fashion. 
This idea has been adopted by many methods described in Section |3.2 These methods 
differ from each other mainly in how they solve the sparse regression subproblem. The only 
difference between TIGER and these methods is that TIGER solves its column-wise sparse 


regression problem using the SQRT-Lasso (Belloni et ah, 2012) 


The SQRT-Lasso is a penalized optimization algorithm for solving high dimensional linear 
regression problems. For a linear regression problem Y = X/3 + e, where Y G R T is the 
response vector, X G M Txp is the design matrix, (3 G M p is the vector of unknown coefficients, 
and e G M T is the noise vector. The SQRT-Lasso estimates {3 by solving 


^ = arg mm|-^= 11 Y - X/3|| 2 + AH^Hi}, 


where A is a tuning parameter. It is shown in Belloni et ah (2012) that the choice of A for 


the SQRT-Lasso method is asymptotically universal in the sense that it does not depend on 
any unknown parameters such as the noise variance. In contrast, most of other methods, 
including the Lasso and Dantzig selector, rely heavily on variance of the noise. Moreover, 
the SQRT-Lasso method achieves near oracle performance for the estimation of (3. 


In Liu and Wang (2012), they show that the objective function of the scaled-Lasso is 


a variational upper bound of the SQRT-Lasso. Thus the TIGER method is numerically 


equivalent to the method in Sun and Zhang (2012). However, the SQRT-Lasso is more 


amenable to theoretical analysis and allows us to simultaneously establish optimal rates of 
convergence for the precision matrix estimation under many different norms. 

In our setting, recall that S is the sample covariance matrix of Y t = (lit, • • •, Ypt)'- Let 
T = diag(S) be a p-dimensional diagonal matrix with the diagonal elements be the same as 
those in S. Consider the marginally standardized variables 


z:= (z 1 ,...,z p y = Yr- 1 ^ 2 . 
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By ([8]), we have 


We define 


Zj Tf = a ' jZsj T % 


+ G 


( 10 ) 


ft =- bALLA and t 2 = ^f" 1 . 


3 33 


Therefore, we have 


ft = ftfti + r jW 


JJ •?' 


( 11 ) 


We define R to be the sample correlation matrix: R := (diag(S)) 1 ^“S(diag(S)) ^ 2 . 
Motivated by the model in ( pTlj ) , we propose the following precision matrix estimator. 

TIGER Estimator 

For j = 1,... ,p, we estimate the j th column of © by solving : 


Pi '= argmin< yjl - 2/3'. R w + /3'R jt ,ffj + vr 

/3^-eRp- 1 L 


logp | 


\Pi\\i [ ’ 


T j :— y 1 2(3jR\jj + 


© = y 2 r-* and © w = -TT 2 f-y 2 f-y 2 /3 : 


s—2-p—1 

’ 3 ■ L 33 


3 w 1 

r 

3 ~ 33 \j,\j ^ ~ 


( 12 ) 

(13) 


Note that the first term in (12) is just the square-root of the the sum of the square loss for 

is 


the standardized variable under model ( JITj ); see (14). We see that the TIGER procedure is 
tuning free. If a symmetric precision matrix estimate is preferred, we conduct the following 
correction: © jfe = min j© jfc , for all k =/=■ j. Another symmetrization method is 


© = 


© + ©' 


Cai et al. 


(2011) show that, if © is a good estimator, then © will also be a good estimator: 


they achieve the same rates of convergence in the asymptotic settings. 

* 3 *jj for j — 1,... ,p. An 


Let Z G M. Txp be the normalized data matrix, i.e., Z= Y* ; T 1/ 
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equivalent form of (12) and (13) is 


Pj = argminl A=\\Z - Z*^|| 2 + 

fyeRp- 1 L V 1 

^ *\jP. 


U = 


v 't'' z "’ 


J II 2' 


(14) 

(15) 


Once © is estimated, we can also estimate the graph G := (V, E) based on the sparsity 
pattern of 0 jfc ^ 0. 


Liu and Wang (2012) establish the rates of convergence of the TIGER estimator © to 


the true precision matrix © under different norms. Under the assumption that the condition 
number of © is bounded by a constant, we have 


0-0 


= O t 


l©lli 


log p \ 
T ) 


(16) 


Under mild conditions, the obtained rate in (16) is minimax optimal over the model class 


consisting of precision matrices with bounded condition numbers. 


The result in (16) implies that the Frobenious norm error and spectral norm error between 
© and © satisfy the following: let s := 1 {® jfc 7^ 0} be the number of nonzero off- 

diagonal elements of ©; let k : = maxj =1 .. p V . 1 {0^ ^ 0}, 


0 - © L = O f 


101 


(p + s ) logpA 
T ) 


I © — ® 11 o = Op fc||0| 


logp 


T 


(17) 

(18) 


The obtained rates in (18) and ([l7|) arc minimax optimal over the same model class as before. 


3.3.2 The EPIC method 

Another tuning-insensitive precision matrix estimation method is EPIC (Estimating 


Precision matrix with Calibration), proposed by Zhao and Liu (2014). While TIGER can be 


viewed as a tuning-insensitive extension of the nodewise Lasso method proposed by |M 


em- 


shausen and Biihlmann (2006b), EPIC can be viewed as a tuning-insensitive extension of 


the CLIME estimator proposed by Cai et al. (2011). Unlike the TIGER method which relies 


on the normality assumption, the EPIC method can be used to handle both sub-Gaussian 
and heavy-tailed data. We postpone the details of the EPIC method to Section 4 where we 
discuss robust estimators of covariance and precision matrices for heavy-tailed data. 
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4 Robust precision and covariance estimators 


The methods introduced in Section 2 and Section 3 exploit the sample covariance matrix 
as input statistics. The theoretical justification of these methods relies on the sub-Gaussian 
assumption of the data, ffowever, many types of financial data are believed to follow the 
elliptical distributions, which are often heavy-tailed. This section introduces a regularized 
rank-based framework for estimating large precision and covariance matrices under elliptical 
distributions. First, we introduce a rank-based precision matrix estimator which naturally 
handles heavy-tailness and conducts parameter estimation under the elliptical models. Sec¬ 
ondly, we introduce an adaptive rank-based covariance matrix estimator which extends the 
generalized thresholding operator by adding an explicit eigenvalue constraint. We also pro¬ 
vide interpretations of these rank-based estimators under the more general elliptical copula 
model, which illustrates a tradeoff between model flexibility and interpretability. 


Throughout this section, we assume the data follow an elliptical distribution (Fang et al. 


1990), defined as below. 


Definition 1 (Elliptical Distribution). Given pi e and a symmetric positive semidefinite 
matrix X e R pxp with rank(X) = r < p, a p-dimensional random vector Y = (Ifi,..., Y p )' 
follows an elliptical distribution with parameters pi, f, and X, denoted by Y ~ EC(n,£, X), 
if Y has a stochastic representation 


Y = n + £Au, 


(19) 


where £ > 0 is a continuous random variable independent of u. Here u£§ r 1 is uniformly 
distributed on the unit sphere in and X = A A 7 . 

For notation convenience, we use f instead the distribution of f in the notation EC {pi, £, X) 


Note that the model in (19) is not identifiable since we can rescale A and f without chang¬ 


ing the distribution. In this section, we require E(£ 2 ) < oo and rank(X) = p to ensure the 
existence of the inverse of X. In addition, we impose an identifiability condition E(£ 2 ) = p 
to ensure that X is the covariance matrix of Y. We still denote © := X -1 . 


4.1 Robust precision matrix estimation 

To estimate ©, our key observation is that the covariance matrix X can be decomposed 
as X = DRD, where R is the Pearson’s correlation matrix, and D = diag(cr!,... ,a p ) where 
<jj is the standard deviation of Y). Since D is diagonal, we can represent the precision matrix 
as © = D -1 AD -1 , where A = R 1 is the inverse correlation matrix. Based on this relation¬ 


ship, the EPIC method of Zhao and Liu (2014) has three steps: first obtain estimators R and 
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D for R and D; then apply a calibrated inverse correlation matrix estimation procedure on 
R to obtain A, an estimator for A. Finally, assemble A and D to obtain a sparse precision 
matrix estimator 0 . 

For light-tailed distributions (e.g., Gaussian or sub-Gaussian), we can directly use the 
sample correlation matrix and sample standard deviation to estimate the matrices R and 
D. However, for heavy-tailed elliptical data, the sample correlation matrix and standard 
deviation estimators are inappropriate. Instead, we exploit a combination of the transformed 
Kendall’s tau estimator and Catoni’s M-estimator, which will be explained in details in the 
following subsections. 


4.1.1 Robust estimation of correlation matrix 


To estimate R, we adopt a transformed Kendall’s tau estimator proposed in Fang etah 


(1990). Define the population Kendall’s tau correlation between Y ]t and Y kt as 


r kj = P (Y jt - Y jt )(Y kt -Y k )> 0 - P (Y jt - Y,)(Y kt -Y k ) < 0 , 


where Yj and Y k are independent copies of Yj t and Y kt respectively. For elliptical distributions, 
it is a well known result that R k j and r k j satisfy 


R — [Rfcj] 


sin 


7T 

2 Tki 


( 20 ) 


The sample version Kendall’s tau statistic between Yj and Y k is 

?kj = T(T- 1) ^ ' Slgn { (Ykt ~ ~ Y ») 


for all k 7 ^ j, and T k j = 1 otherwise, 
correlation matrix estimator 


We can plug into (20) and obtain a rank-based 


R — [Rfcj] 



( 21 ) 


4.1.2 Robust estimation of standard deviations 


To estimate D, we exploit an M-estimator proposed by Catoni (2012). 
- 0 (f) = sign(f) ■ log(l + |t| +t 2 / 2 ) be a univariate function where sign( 0 ) = 0 . 


Specifically, let 
Let and rhj 


1 More details can be found in 


Fang et al. (1990). 
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be the estimators of E Yj t and E Y? t by solving the following two estimating equations: 


<=i 


^ (fa* _ ft) 


TK 

-L -LYyy 


= o, 



( 22 ) 

(23) 


where K max is an upper bound of maxj Var(Y) t ) and rnaxj Var(Y^). We assume /i max is 


known. Catoni (2012) shows that the solutions to (22) and (23) must exist and can be 


efficiently solved using the Newton-Raphson algorithm (Stoer et ah, 1993). Once fhj and Jl 3 
are obtained, we estimate the marginal standard deviation <jj by 


to = 


y n , - jr A‘ 1II1|; }. 


(24) 


where K m[n is a lower bound of miiij cr| and is assumed to be known. 

Compared to the sample covariance matrix, a remarkable property of R and a 3 is that 
they concentrate to their population quantities exponentially fast even for heavy-tailed data. 


More specifically, Liu et al. (2012b) show that 


IR-R 


= Oi 


log P \ 
T ) 


and max hr,- — crA = Op 
i <j<p' 1 


log P\ 
T )' 


(25) 


In contrast, the sample correlation matrix and sample standard deviation do not have the 
above properties for heavy-tailed data. 


4.1.3 The EPIC method for inverse correlation matrix estimation 

Once R and D are obtained, we need to estimate the inverse correlation matrix A = R 1 . 
In this subsection, we introduce the EPIC method for estimating A, which estimates the 
j th column of A by plugging the transformed Kendall’s tau estimator R into the convex 
program, 


(A = argmin 


'*3 


l 1 +2 r i’ 


s.t. 


| I*J 11 OO — ? 


11 1 — 7~j * 


(26) 


Here r,- serves as an auxiliary variable which ensures that we can use the same regularization 


parameter A for estimating different columns of A (Gautier and Tsybakov, 2011). Both the 


objective function and constraints in (26) contain r,-, which ensures that t :/ is bounded. Zhao 


and Liu (2014) show that the regularization parameter A in (26) does not depend on the 
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unknown quantity A. Thus we can use the same A to estimate different columns of A. 


The optimization problem in (26) can be equivalently formulated as a linear program. For 


notational simplicity, we omit the index j in (26). We denote AI 7 , and t 3 by 7 , e, and 


r respectively. Let 7 + and 7 be the positive and negative parts of 7. By reparametrizing 


7 = 7 + — 7 , we rewrite (26) as the following linear program 


(7 + ,7 , r) = argrnin 1'7 + + 1/7 + cr 


s.t. 


R -R -A 
-R R -A 

V V -1 

7 + > 0 , 7' > 0 , r > 0, 



7 + 


e 


7" 

< 

— e 


r 


0 


(27) 


where A = Al. The optimization problem in (27) can be solved by any linear program solver 


(e.g., the classical simplex method as suggested in Cai et al. (2011)). In particular, it can be 


efficiently solved using the parametric simplex method (Vanderbei, 2008), which naturally 


exploits the underlying sparsity structure, and attains better empirical performance than a 
general-purpose solver. 

4.1.4 Symmetric precision matrix estimation 

Once we obtain the inverse correlation matrix estimate A, we can estimate © by 

© = D -1 AD -1 . 


The EPIC method does not guarantee the symmetry of ©. To obtain a symmetric estimator, 
we take an additional projection step: 


© = argrnin ||© — ©||* s.t. © = ©', 
© 


(28) 


where || • ||* can be the matrix 0-, Frobenius, or elementwise max norm. For both the 


Frobenius and elementwise max norms, (28) has a closed form solution 


© = -(© + ©' 


When using the matrix fj-norm, the optimization problem in (28) does not have a closed- 


form solution. For this, we can exploit the smoothed proximal gradient algorithm to solve 


it. More details about this algorithm can be found in Zhao and Liu (2014). 
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Consider a class of sparse symmetric matrices 


U(s,M,k u ) = {A e R pxp Ay 0, maxj] 1 {A kj ± 0} < s, || A||i < M, A, 


(A) < KV 


where n u is a constant, and ( s,p,M ) may scale with the sample size T. Under some mild 
conditions, 


Zhao and Lin 


(2014) show that if we take A = Ki\J (logp)/T and choose the 


matrix t'j-norm as || • ||* in (28), then for large enough T, we have 


\0.0h=o P [M- s J^y 


(29) 


Moreover, if we choose the Forbenius norm as 


in (28), then for large enough T, 


-II©-0|| 2 = O p (M 2 


s logp^ 
T )' 


(30) 


4.2 Robust covariance matrix estimation 

In this subsection, we consider the problem of estimating the covariance matrix X under 


the elliptical model (19). Similar to Section 2, we impose sparsity assumption on X. To 


estimate X, Liu et al. (2014b) introduce a regularized rank-based estimation method named 


EC2 (Estimation of Covariance with Eigenvalue Constraints), which can be viewed as an ex¬ 


tension of the generalized thresholding operator (Rothman et ah, 2009). The EC2 estimator 


can be formulated as the solution to a convex program which ensures the positive definite¬ 
ness of the estimated covariance matrix. Unlike most existing methods, the EC2 estimator 
explicitly constrains the smallest eigenvalue of the estimated covariance matrix. 


4.2.1 The EC2 Estimator 

Recall that X = DRD. Similar to the EPIC method, we calculate the EC2 estimator 
in three steps: In the first step, we obtain robust estimators R and D for R and D. In the 
second step, we apply an optimization procedure on R to obtain R EC2 , a sparse estimator 
for R. In the third step, we assemble R EC2 and D to obtain the final sparse covariance 
matrix estimator X = DR Et 2 D. Specifically, we calculate R and D as in (21) and (24). In 
the following, we focus on explaining how to obtain R EC2 based on R. 

Recall that R is the transformed Kendall’s tau matrix, the R EC2 is calculated as 


R EC2 := argmin ^||R - R||| + A||R||i i0ff s.t. r < A min (R) 

diag(R)=l 2 


(31) 
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where A > 0 is a regularization parameter, and r > 0 is a desired minimum eigenvalue lower 
bound of the estimator which is assumed to be known. The EC2 method simultaneously 
conducts sparse estimation and guarantees the positive-definiteness of the solution. The 


»EC2 


equality constraint diag(R) = 1 ensures that R EC2 is a correlation matrix. Once R 
obtained, we convert it to the final covariance matrix estimator X as described above. 


is 


Liu 


et al. (2014b) prove the convexity of the formulation in (31). Alternatively, one can apply 


thresholding on R to obtain a positive definite estimator. 


4.2.2 Asymptotic properties of the EC2 estimator 

To establish the asymptotic properties of the EC2 estimator, for 0 < q < 1, we consider 
the following class of sparse correlation matrices: 

M. (q, M p , 8) < R : max ^^|Rj fc |' i < M p and Rjj = 1 for all j, A min (R) > 5 >. 

^ i - 2 - p k^j J 


We also define a class of covariance matrices: 


U (k, q, M p , 5) := | X : max Xjj < n and D x XD 1 e M (q, M p , 5) |, 


(32) 


where D = diag(v / X)7,..., -y/X^)). The dehnition of this class is similar to the “universal 
thresholding class” defined by Bickcl and Levina; (2008). 


Under the assumption that the data follow an elliptical distribution, Liu et al. (2014b) 
show that, for large enough T, the EC2 estimator X satisfies 


1-9 


sup E||X 

£e«(K,<2,Mp,<5 min ) 


EC2 


< ci • M r 


flogp\ 2 


| 2 — -1 *’ Py rp j 


(33) 


Cai and Zhou (2012) show that the rate in (33) attains the minimax lower bound over the 


class U(k , q, M d , 5 min ) under the Gaussian model. Thus the EC2 estimator is asymptotically 
rate optimal under the flexible elliptical model with covariance matrix in U(n, q , M d , (fmin)- 


4.3 Extension to the elliptical copula family 

In Sections 4.1 and 4.2, we introduced the regularized rank-based covariance and precision 
matrix estimation methods by assuming the underlying distribution of Y = (Yi,..., Y p )' is 
elliptical. In fact, these rank-based procedures also work within the more general transellip¬ 


tical family (Liu et al. 2012c), which is exactly the elliptical copula family but with different 


identihability conditions. More specifically, we say Y = (Y,..., Y p )' follows a transellipti- 
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(a) The Vein diagram (b) The perspective plot of a transelliptical density 

Figure 2: Transelliptical family, (a) The Vein diagram illustrating the relationships of the 
distribution families (The Nonparanomral family is equivalent to the Gaussian copula fam¬ 
ily). (b) The perspective plot of a transelliptical density. 


cal distribution, denoted by Y ~ TE(n, E, £; /), if there exists a set of strictly increasing 
functions {fj} P J= \ such that f(Y ) = (fi(Yi),...,f p (Y p ))' follows the elliptical distribution 


EC(^l,C,, E). To ensure the model is identifiable, Liu et al. (2012c) impose the identihability 
condition that, for j G {1,... ,p}, 


E fjiYj) = E Yj and Var(/■(*})) = Var(V j ) 


(34) 


As the Kendall’s tau statistics in (21) are invariant under the monotonic transform, the 


Kendall’s tau statistics for the elliptical data f(Y) are the same as those for the transel- 
lipitical data Y. Therefore, we do not need to estimate the monotonic transformations / 
for computing the Kendall’s tau. On the other hand, these monotonic transforms are not 
hard to estimate. For example, for the Gaussian copula such that the marginal distribution 
m) r^j N( 0,1), then based on the empirical distribution of the observed data Yj and the 
known marginal distribution 1V(0,1), we can easily estimate fj. 

Figure [2](a) illustrates the relationships between the transelliptical, elliptical, and non¬ 


paranormal families (Liu et ah, 2009, 2012b). The nonparanormal family is a a proper subset 


of the transclliptical family. We define Y = {Y\ ,..., Y p )' to be a nonparanormal distribution, 
denoted by Y ~ NPN(n, E, /), if there exists a set of strictly increasing functions {fj} p =1 
such that f(Y) = (fi(Yi), ..., f p (Y p )y follows the Gaussian distribution A r (/z, E). Liu et al. 


(2012c) show that the intersection between the nonparanomral family and elliptical family is 


the Gaussian family. Figure [2](b) visualizes the perspective plot of a bivariate transelliptical 
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Figure 3: The hierarchical latent variable representation of the transelliptical graphical model 
with the latent variables grey-colored. Here the first layer is composed of observed Y)’ s, 
and the second and third layers are composed of latent variables Zf s and Xj’s. The solid 
undirected lines in the third layer encode the conditional independence graph of Xi,... ,X p 
(Adapted from a manuscript that is under review). 


density with certain marginal transformations. The transelliptical family is much richer than 
the elliptical family and its density function does not have to be symmetric. 

The rank-based EPIC and EC2 methods can be directly applied to the transelliptical 


family. To understand the semantics of a transelliptical graphical model, Liu et al. (2012c) 
proved that a transelliptical distribution admits a three-layer hierarchical latent variable 
representation as illustrated in Figure |3j The observed vector, denoted by Y — (Yi,..., Y p )' 
as presented in the first layer, has a transelliptical distribution, and a latent random vector, 
Z = (Z 1} ..., Z p y in the second-layer, is elliptically distributed. Variables in the first and 
second layers are related through the transformation Z 3 = fj{Yj) with fj being an unknown 
strictly increasing function. The latent vector Z can be further represented by a third-layer 
latent random vector X = (Ad,..., X p ) r , which is a multivariate Gaussian with a covariance 
matrix X (called latent covariance matrix) and an inverse covaraince matrix © = S " 1 (called 
latent precision matrix). 

We define the transelliptical graph G = (V, E) with the node set V = {l,...,p} and 
the edge set E encoding the nonzero entries of ©. The interpretations of the graph G are 
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different for the variables in different layers: (i) For the observed variables in the first layer, 
the absence of an edge between two variables means the absence of a certain rank-based 
association (e.g., Kendall’s tau) of the pair given other variables; (ii) For the latent variables 
in the second layer, the absence of an edge means the absence of the conditional Pearson’s 
correlation of the pair; (iii) For the third layer variables, the absence of an edge means the 
conditional independence of the pair. Compared with the Gaussian and elliptical graphical 
model, the transelliptical graphical model has richer structure with more relaxed modeling 
assumptions. The three layers of hierarchy also reflects an interesting tradeoff between model 
flexibility and interpretability. In the third layer, the model is the most restrictive Gaussian 
family, but we can get strong conditional independence arguments. In contrast, in the first 
layer, the model is the much more flexible transelliptical family, but we can only get weaker 
conditional uncorrelatedenss (with respect to the rank correlation) statements. 

Since the Kendall’s tau statistic is monotone transformation invariant, it is easy to see 
that the theory and methods of the EPIC and EC2 procedures introduced in Section 4.1 and 
Section 4.2 are also applicable to the transelliptical distributions, though the interpretations 
of the fitted results are different (as explained in this section). 


5 Factor model-based covariance estimation with ob¬ 
servable factors 


Most of the aforementioned methods of estimating S assumes that the covariance matrix 
is sparse. Though this assumption is reasonable for many applications, it is not always 
appropriate. For example, financial stocks share the same market risks and hence their 
returns are highly correlated; all the genes from the same pathway may be co-regulated by a 
small amount of regulatory factors, which makes the gene expression data highly correlated; 
when genes are stimulated by cytokines, their expressions are also highly correlated. The 
sparsity assumption is obviously unrealistic in these situations. 

In many applications, the responses of cross-sectional units often depend on a few common 
factors f : 

Yu = b [ft + u it . (35) 


Here b, is a vector of factor loadings; ft is a K x 1 vector of common factors, and u lt is 
the error term, usually called idiosyncratic component , uncorrelated with f t . Factor models 
have long been employed in financial studies, where Y lt often represents the excess returns 


of the it\i asset (or stock) on time t. The literature includes, for instance, Faina and French 


(1992); Chamberlain and Rothschild (1983); Campbell et al. (1997). It is also commonly 


25 










used in macroeconomics for forecasting diffusion index (e.g., Stock and Watson (2002)). We 
allow p, T —y oo and that p can grow much faster than T does. In contrast, the number of 
factors K needs to be either bounded or grows slowly. 

This section introduces a method of estimating £ using factor models. We will focus on 
the case when the factors are observable. The observable factor models are of considerable 
interest as they are often the case in empirical analyses in finance. 


5.1 Conditional sparsity 


The factor model (35) can be put in a matrix form as 


Y t = B ft + u t . 


(36) 


where B = (fcp, ...,b p )' and u t = (u u , ■ ■■,u pt )'. We are interested in £, the p x p covariance 
matrix of Y t , and its inverse © = £ _1 , which are assumed to be time-invariant. Under 


model (36) and the independence assumption between f t and u t , £ is given by 


£ = BCov(/ t )B' + £ u , 


(37) 


where £ u = ( cr u ,ij) P x P is the covariance matrix of u t . Estimating the covariance matrix £ n 
of the idiosyncratic components {u t } is also important for statistical inferences. 

Fan et al. (2008) studied model (l37j) when p —» oo possibly faster than T. They assumed 


£ u to be a diagonal matrix, which corresponds to the classical “strict factor model”, and 
might be restrictive in practical applications. On the other hand, factor models are often only 
justified as being “approximate”, in which the Y lt , ...,Y pt are still mutually correlated given 
the factors, though the mutual correlations are weak. This gives rise to the approximate 


factor model studied by Chamberlain and Rothschild (1983). In the approximate factor 


model, £ u is a non-diagonal covariance matrix, and admits many small off-diagonal entries. 


In the decomposition (37), we assume £ M to be sparse. This can be interpreted as the 


conditional sparse covariance model: Given the common factors /i,.... //-, the conditional 
(after taking out the linear projection on to the space spanned by the factors) covariance 
matrix of Y t is sparse. Let 




J rnaXjXp Y7 j= i 1 Wu,ij ~f~ 0} ? if g = 0 


(38) 


|maxj<p Yfj =i \ a u,ij\ 9 , if 0 < q < 1 

We require m U}P be either bounded or grow slowly as p —> oo. The conditional sparsity 
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assumption is slightly stronger than those of the approximate factor model in Chamberlain 


and Rothschild (1983), but is still a natural assumption: the idiosyncratic components are 


mostly uncorrelated. In contrast, note that in the presence of common factors, X itself is 
hardly a sparse matrix. 


5.2 Estimation 

When the factors are observable, one can estimate B by the ordinary least squares (OLS): 
B = (b 1; ..., bjv)', where, 


b, 


1 

arg min — 
b i T 


T 


- b 'M 2 ’ 


i = 1,..., N. 


Then, u t = Y t - Bf t is the residual vector at time t. We then construct the residual 
covariance matrix as: 

1 T 

^ Ut u t — ( s u,ij)- 
t= 1 

Since is sparse, we now apply thresholding on S u to regularize the estimator. Define 


Y u — 


(@u,ij)pxpi &u,ij 


h ( Su,ij i T,ij ) 5 


i = ji 

*7 1 j- 


Here h(.]uiT,ij ) is a general thresholding rule as described in Section 2. Both the adaptive 
thresholding and entry dependent thresholding can also be incorporated, by respectively 
setting u T ,ij = SE (s Utij )oj T and u T ,ij = u T , with 


_ /log P 
uj t — CK y 

for some C > 0. As in the discussions in Section 2, C > 0 can be chosen via cross-validation 
in a proper range to guarantee the finite sample positive definiteness. 

The covariance matrix Cov(ft) can be estimated by the sample covariance matrix 

T T 

C OT(/ t ) = /)(/, - /)', /=!£>, 

t =1 t= 1 

which does not require regularization since the number of factor is assumed to be small. 
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Therefore we obtain a substitution estimator: 


S = BCov(/t)B' + £ u . 


By the Sherman-Morrison-Woodbury formula, we estimate the precision matrix as 


E' 1 = E; 1 - E-'BICovf/,)- 1 + B , E“ 1 B] _1 B , E“ 1 . 


Under regularity conditions, Fan et al. (2011) showed that when 9 —» 0, 


- SJ 2 = 0 P (m u , p </ 9 ), US/ - S/|| 2 = 0 P (m U}P u 4 9 ), 




On the other hand, it is difficult to obtain a satisfactory convergence rate for S under either 
the operator or the Frobenius norm. We illustrate this problem in the following example. 
Let 0 d be a d-dimensional row vector of zeros. 


Example 1. Consider the specific case K = 1 with the known loading B = l p and = I. 
Then S = Var(/i)l p lJ + I, where l p denotes the p-dimensional column vector of ones with 
||l p lp|| 2 = p, and we only need to estimate Var (ffi) using the sample variance. Then, it 
follows that 

1 T 

II s - s ll2 = IT(/ii - /if - Var(/ It )| • ||m;|| 2 , 

t= 1 

Therefore, it follows from the central limit theorem that ^||S—S|| 2 is asymptotically normal. 
Hence ||S — S|| 2 diverges if p ^ VT, even for such a simplified toy model. 


In the above toy example, the bad rate of convergence is mainly due to the large quan¬ 
tity ||l p lp|| 2 , which comes from the high-dimensional factor loadings. In general, the high¬ 
dimensional loading matrix accumulates many estimation errors. 


On the other hand, Fan et al. (2011) showed that we can obtain a good convergence rate 
when estimating E” 1 : 


i-i 


i—i i 


2 = 0 P (m UtP u l T q ). 


Intuitively, the good performance of S 1 follows from the fact that the eigenvalues of S 1 
are uniformly bounded, whereas the leading eigenvalues of S may diverge fast. 








6 Factor models-based covariance estimation with la¬ 
tent factors 

In many empirical studies using factor models, the common factors are often latent, 
that is, they are unobservable. In this case, the covariance matrix of Y t has the same 
decomposition as before: 

S = B Cov(/ t )B / + £„, (39) 

but the latent factors also need to be estimated. Similar to the case of observable factors, 
the model can be assumed to be conditionally sparse, where £ u is a sparse matrix but not 
necessarily diagonal. In this section we shall assume the number of factors to be bounded. 


6.1 The pervasive condition 


Note that unlike the classical factor analysis (e.g., Lawley and Maxwell (1971)), when 
Ti u is non-diagonal, the decomposition (39) is not identifiable under fixed ( p,T ), since Y t 
is the only observed data in the model. Here the identification means the separation of 


the low-rank part B Cov(/ t )B / from £ u in the decomposition (39). Interestingly, however, 
the identification of B Cov(/ t )B / can be achieved asymptotically, by letting p —s- oo and 
requiring the eigenvalues of £ u to be either uniformly bounded or grow slowly relative to p. 
What makes the “asymptotic identification” possible is the following pervasive assump¬ 


tion, which is one of the key conditions assumed in the literature (e.g., Stock and Watson 
p002|);[B^I|p003|): 


Assumption 1 . The eigenvalues of the K x K matrix p 1 B / B = ^ Xu =1 bb' are uniformly 
bounded away from both zero and infinity, as p —>■ oo. 


When this assumption is satisfied, the factors are said to be “pervasive”. It requires the 
factors impact on most of the cross-sectional individuals. It then follows that the first K 
eigenvalues of B Cov(/ f )B ; are bounded from below by cA min (Cov(/i))p for some c > 0, and 
should grow fast with p. On the other hand, 


p 

£, u || 2 < max V'' \o’ Uj ij\ q \a Uj aa Uj jj\^ 1 ~ q ^ 2 < m UtP max 

i<p < i<p ’ 

3 = 1 


(40) 


Hence when m UtP grows slower than 0(p), the leading eigenvalues of the two components 
on the right hand side of (39) are well separated as p —» oo. This guarantees that the 
covariance decomposition is asymptotically identified. Intuitively, as the dimension increases, 
the information about the common factors accumulates, while the information about the 
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idiosyncratic components does not. This eventually distinguishes the factor components 
B ft from u t . 

Below we shall introduce a principal component analysis (PCA) based method to estimate 
the covariance matrix. 


6.2 Principal Component and Factor Analysis 

Before introducing the estimator of X in the case of latent factors, we first elucidate why 
PCA can be used for the factor analysis when the number of variables is large. First of all, 
note that even if B Cov(/ t )B / is asymptotically identifiable, B and f t are not separately 
identifiable, since the pair (B , f t ) is equivalent to the pair (BH -1 .H/ t ) for any K x K 
nonsingular matrix H. To resolve the ambiguity between B and f t , we impose the identib- 
ability constraint that Cov(f t ) = 1 K and that the columns of B are orthogonal. Under this 


canonical form, it then follows from (39) that 


X = BB' + X,,. 


Let bi, ..., bj<- be the columns of B. Since the columns of B are orthogonal, 

BB'bj = bj-11bj-1||, for j < K. 

Therefore, bi/||bi|| 2 , • • • , t>/c/||b^ || 2 are the eigenvectors of BB', corresponding to the largest 
K eigenvalues {||b^ ||llyLii the rest p — K eigenvalues of BB' are zeros. To guarantee the 
uniqueness (up to a sign change) of the leading eigenvectors, we also assume {|| b^ || 2 }^)=! are 
distinct and sorted in a decreasing order. To see how large these eigenvalues are, note that 
the first K eigenvalues of BB' are the same as those of B'B Hence it follows from the 
pervasive assumption (Assumption 1) that 

||by||l > cp, j — 1,..., K. (41) 


Next, let us associate the leading eigenvalues of BB' with those of X. Let Ai,...,A^ 
denote the K largest eigenvalues of X, and let C, be the corresponding eigenvectors. 


Applying Wely’s theorem and the sin(6 l )-theorem of Davis (1963), Fan et al. (2013) showed 


U II 2 II 2 


= Oip 


-ll 


J u||2h 


for all j < K. 


and 

|A j - ||bj|| 2 1 < ||S U || 2 , for j < K , |Aj| < ||X U || 2 , for j > K. 
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These results demonstrate: 


1. The leading eigenvectors of X are approximately equal to the normalized columns of 
B, as p —> oo. In other words, the factor analysis and the principal analysis are 
approximately the same. 


2. The leading eigenvalues of X grow at rate 0(p). This can be seen from applying the 
triangular inequality and (40), (41): 


A j > 11bj 111 - |A j - 


N 


> cp - rn u , p max ct'T, 

t<p 5 


Vj = K. 


3. The latent factor f jt is approximately yf\j for j = 1 To see this, left- 

multiplying b'-/ 11 b ? 11 2 to Y t = B ft + u t , and noting that the columns of B are orthog¬ 
onal, we have 

f ]t = ^Y t /\\b J \\l-^u t J\\b J \g 

The second term on the right is the weighted average of noise u t over all p individuals 
and hence typically negligible when p is large. The first term is 

b'/IN| 2 ^ ^ j' 3 Y t 

INI! II b ill 2 

Hence as p oo, f jt « £' Y t /y/\]. 

Therefore, we conclude that the first K eigenvalues of X are very spiked, whereas the 
remaining eigenvalues are either bounded or grow slowly. In addition, both the latent factors 
and loadings can be approximated using the eigenvalues and eigenvectors of X and Y t . This 
builds the connection between the PCA and high-dimensional factor models. 


6.3 POET estimator 


Fan et al. (2013) proposed a nonparametric estimator of X when the factors are unobserv¬ 
able, named POET (Principal Orthogonal complEment Thresholding). To motivate their 
estimator, note that BB' = b ? b). From the discussions of the previous subsection, 

heuristically we have 

K K 

3 = 1 i =1 
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In fact, it can be formally proved that 

K 

l|BB'-5>«'|U = 0(p- 1/2 ), 

3 = 1 

which can be understood as the (asymptotic) identihcation for BB'. In addition, note 
that £ has the spectral decomposition £ = Xp=i an d the f ac t° r decomposition 

£ = BB' + £ u . Therefore, 

£ Wii'i- 

j=K +1 

Under the conditional sparsity assumption, X^j=if+i i s approximately a sparse matrix. 

One can then estimate £ u by thresholding the sample analogue of Y^=k +i 

Specifically, the POET estimator is dehned as follows. Let Ai > A 2 > • • • > A p be the 
ordered eigenvalues of the sample covariance matrix S, and £i,...,£ p be the corresponding 
eigenvectors. Then the sample covariance has the following spectral decomposition: 

K 

S = £ U.& + s„, 

2=1 

where S u = Y^k=K+\ ^k€k£k = ( s u,ij ), called “the principal orthogonal complement”. We 
apply the generalized thresholding rule on S u . Define 


(@u,ij) py.pi &u,ij 


Su,iii ^ Ji 

h{,Su,ij'i ^T,ij) 1 i 7 ^ j- 


For instance, the entry dependent thresholding sets c Ur,ij = y/s Ui uS U jjU)T- Importantly, ut is 
different from before when the factors are latent, and should be set to 



It was then shown by Fan et al. (2013) that 


max | s uij - a u>ij \ = O p (uj t ). 

i,j<p 

The extra term -L in u T is the price paid for not knowing the latent factors, and is negligible 
when p grows faster than T. Intuitively, when the dimension is sufficiently large, the latent 
factors can be estimated accurately enough as if they were observable. 
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The POET estimator of E is then defined as: 


K 


^ K£,i£,i + 


(42) 


i— 1 


This estimator is optimization-free and is very easy to compute. 

Note that E k requires the knowledge of K , which is the number of factors and practically 
unknown. There has been a large literature on determining the number of factors and many 


consistent estimators have been proposed, such as Bai and Ng (2002); Alessi et al. (2010); 


Hallin and Liska (2007[), and Ahn and Horenstein (2013). In addition, numerical studies 


m 


Fan et al. (2013) showed that the covariance estimator is robust to over-estimating K. 


Therefore, in practice, we can also choose a relatively large number for K even if it is not a 
consistent estimator of the true number of factors. In the sequel, we suppress the subscript 
K , and simply write S as the POET estimator. 


6.4 Asymptotic Results 


Under the conditional sparsity assumption and some regularity conditions, Fan et al. 
(2013) showed that when ujlf q rn UiP —> 0, we have 


SJ 2 - 0 P (oj? q m UjP ) , ||- E ,/||2 = 0 P (u l T q m u , p ) . 


On the other hand, the problem of bad rate of convergence for E is still present, because 
the first K eigenvalues of E grow with p. We can further illustrate this point in the following 


example (taken from Fan et al. (2013)): 


Example 2. Consider an ideal case where we know the spectrum except for the first eigen¬ 
vector of E, and assume that the largest eigenvalue Ai > cp for some c > 0. Let be the 
estimated first eigenvector and define the covariance estimator E = Ai^i^ + 

Assume that £1 is a good estimator in the sense that ||£i — £i|| 2 = O p {T~ l ). However, 


E-E|| 2 


IIAilfh; - €i€i)|| 2 = AiO p (||£ - £|| 2 ) = O p (A.T- 1 / 2 ), 


which can diverge when T = 0{jp 2 ). 

Similar to the case of observable factors, we can estimate the precision matrix with a 
satisfactory rate under the operator norm. The intuition still follows from the fact that E^ 1 

showed that E _1 has the same rate of 

convergence as that of E” 1 . 


has bounded eigenvalues. Indeed, 


Fan et al. 


(2013 


33 






























7 Structured factor models 


7.1 Motivations 


In the usual asymptotic analysis for factor models, accurate estimations of the space 
spanned by the eigenvectors of S require a relatively large T. In particular, the individual 
loadings can be estimated no faster than Op{ T” 1 / 2 ). But data sets of large sample size 
are not always available. Often we face the “high-dimensional-low-sample-size” (HDLSS) 
scenario, as described in Jung and Marron (2009). This is particularly the case in financial 
studies of asset returns, as their dynamics can vary substantially over a longer time horizon. 
Therefore, to capture the current market condition, financial analysts wish to use short time 
horizon to infer as good as possible the risk factors as well as their associated loading matrix. 
To achieve this, we need additional data covariate information and modeling of the factor 
loadings. 

Suppose that there is a d-dimensional vector of observed covariates associated with the 
i th variable: Xi = (Xu,-- - ,Xid), which is independent of ua- For instance, in financial 
applications, Xi can be a vector of firm-specific characteristics (market capitalization, price¬ 
earning ratio, etc); in health studies, Xi can be individual characteristics (e.g. age, weight, 
clinical and genetic information). To incorporate the information carried by the observed 


characteristics, Connor and Oliver (2007) and Connor et al. (2012) model explicitly the 


loading matrix as a function of covariates X. This reduces significantly the number of 
paramters in B. Specifically, they proposed and studied the following semi-parametric factor 
model: 


K 

Yu ^ ^ 9k(Xi) fkt T Uitj i 1, 

k =1 


,P,t = !,■■■ ,T. 


(43) 


Here gk(Xi ) is an unknown function of the characteristics and they assume further the 
additive modeling 

9k(Xi) = gki(Xn) + • • • + gkd(Xid). (44) 


Fan et al. (2014b) recognized that the above semi-parametric model (43) might be re¬ 


strictive for applications, as we do not expect that the covariates capture completely the 
factor loadings. They extend the model to the following more flexible semiparametric mixed 
effect model: 


K 

Yit = YM*i) + 7 ik]fkt + ug, i = l = ,T. (45) 

k= 1 

Here 7 ^ is an unobservable random component with mean zero. They developed econometric 
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techniques to test the model specifications (43) and (45). Their empirical results, using the 
returns of the components of the S&P500 index and 4 exogenous variables (size, value, 
momentum, and volatility) as in Connor et al. (2012), provide stark evidence that model 
(43) can not be validated empirically whereas (45) is consistent with the empirical data. 


7.2 Projected PCA 

The basic idea of projected PCA is to smooth the observations {Y lt }f =1 for each given day 
t against its associated covariates {X, ; ]T =1 . More specifically, let {Y it y i=l be the fitted value 
after run a regression of {Y it }? =1 against {X,y =l for each given t. The regression model can 


be the usual linear regression or additive regression model (44). This results in a smooth or 


projected observation matrix Y, which will also be denoted by PY. The projected PCA is 
then to run PCA based on the projected data Y. 


To provide the rationale behind this idea, we now generalize model (45) further to illus¬ 


trate the idea behind the projected PCA. Specifically, consider the factor model 


Y = BF' + U 


where Y and U are p x T matrices of y lt and uu. Suppose that there is a d-dimensional 
vector of observed covariates associated with the i th variable: X, = (Xji,--- ,X id ), which 
is independent of uu. For a pre-determined J, let </>i, be a set of basis functions. Let 

and <f>(X) = (^X,),..., <f>(X p ))' be a px ( Jd) 
matrix of the sieve-transformed X. Then the projection matrix on the space spanned by 
X = (X 1 ,...,X p ) can be taken as 

P = $(X)(<h(X) , $(X))" 1 $(X) / . 


This corresponds to modeling gi i (X i ) in (45) by the additive model (44) and approximating 
each term using the series expansion. The projected data PY is the fitted value of the 


additive model (44) with basis functions 0i, ...,0j: 


K j 


Yu ^ ^ 1 @jk,t4*j (-'A*/;)] T £iti i I; iPi^ 1) i T. 


k=1 j=1 


The design matrix does not vary with t, neither does the projection matrix P. 

We make the following key assumptions: 

Assumption 2. (i) Pervasiveness: With probability approaching one, all the eigenvalues 
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of p(PB) / PB are bounded away from both zero and infinity as p —* oo. 

(ii) Orthogonality: E(ui t \Xn,...,X i( i ) = 0, for all i < p,t < T. 

The above conditions require that the strengths of the loading matrix should be as strong 
after the projection, and B should be associated with X. Condition (ii) implies that if we 
apply P to both sides of Y = BF' + U, then 

PY « PBF'. 


where PU « 0 due to the orthogonality condition. Hence the projection removes the noise 
in the factor model. In addition, for the purpose of normalizations, we assume Cov(ft) = Ik, 
and that (PB)'PB is a diagonal matrix. 

We now describe the rationale of the projected PCA. For simplicity, we ignore the effect 
of PU. Let us consider the p x p covariance matrix of the projected data PY. The previous 
discussions show that ^PY(PY)' « PB(PB)'. Since (PB)'PB is a diagonal matrix, the 
columns of PB are the eigenvectors of the pxp matrix 4PY (PY)', up to a factor ^Jp. Next, 
consider the TxT matrix ^(PY)'PY ph ^F(PB)'(PB)F 7 . It implies 

^(PY)'PYF « F(PB)'(PB). 


Still by the diagonality of (PB)'PB, we infer that the columns of F are approximately the 
eigenvectors of the TxT sample covariance matrix ^(PY)'PY, up to a factor y/T. In addi¬ 
tion, since the diagonal elements of (PB)'PB grow fast as the dimensionality diverges, the 
corresponding eigenvalues are asymptotically the first K leading eigenvalues of ^(PY)'PY. 
This motivates the so-called “projected PCA” (Fan et al. (2014b)), a new framework of esti¬ 
mating the parameters for factor analysis in the presence of a known space X. The projected 
PCA can be more accurate than the usual PCA in the HDLSS scenario. It applies really the 
PCA to the projected data (smoothed data) PY. 

Let V be a T x K matrix, whose columns are the eigenvectors of the TxT matrix 
^Y'PY corresponding to the larges K eigenvalues. Following the previous discussions, we 
respectively estimate the projected loading matrix PB and latent factors F by 


G(X) = ^PYF. F = VfX. 


A nice feature of the projected-PCA is that the consistency is achieved even when the 


sample size T is finite, as shown in Fan et al. (2014b). Thus, it is particularly appealing 
in the HDLSS context. Intuitively, there are two sources of the approximation errors: (i) 
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P approximates P and (ii) the normalized B approximates the leading eigenvectors of S. 
Neither of the approximation errors require a large sample size T in order to be asymptotically 


negligible. This implies the consistency under a finite T. See Fan et ah (2014b) for more 
detailed discussions on this aspect. 


7.3 Semi-parametric factor model 


In the model ( |45| ) , let G(X) and T respectively denote the p x K matrices of gk(Xj) and 
7 ij. Then the matrix form of the model can be written as 

Y = [G(X) + riF' + U. 


So the model assumes that the loading matrix can be decomposed into two parts: a part 
that can be explained by X and the part cannot. To deal with the curse of dimensionality, 
we assume gk {-) to be additive: gk(Xi ) = 9 ki(Xu), with d = dim(Xj). 


Applying the projected-PCA onto the semi-parametric factor model, Fan et al. (2014b) 
showed that as p, J —» oo, T may either grow or stay constant, 


^H F -F|| a = 0,(-) 


^=I|G(X) 

Vp 


G(X)|| 2 = 0, 


(pmin{T',p}) 1 / 2 _ 1 A 2ft ) 


where k is the degree of smoothness constant for £/&(•). Clearly under the high dimensionality, 


the rate of convergence is fast even if T is finite. We refer the readers to Fan et ah (2014b) for 


more detailed discussions on the impacts of improved rates of convergence in factor models. 


8 Discussions 

This paper introduces several recent developments on estimating large covariance and 
precision matrices. We focus on two general approaches: rank-based method and factor 
model based method. We also extend the usual factor model to a projected PCA setup, and 
show that the newly introduced projected PCA is appealing in the high-dimensional-low- 
sample-size scenario. Such an approach has drawn growing attentions in the recent literature 
on high-dimensional PCA (e.g., Jung and Marron (2009); Shen et al. (2013a|b); Ahn et al. 


(2007)). In addition, we introduce the rank-based approaches, including the EPIC and 
EC2 estimators, for estimating large precision and covariance matrices under the elliptical 
distribution family. These rank-based methods are robust to heavy-tailed data and achieve 
the nearly optimal rates of convergence in terms of spectral norm errors. 
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A promising future direction is to combine the factor based analysis and rank-based 
analysis into an integrated framework. For instance, consider the factor model 

= B ft + u t 

with observed factors {ft}- Here the idiosyncratic components uf s are heavy-tailed but 
follow the elliptical distribution. Define the population Kendall’s tau correlation between 
Ujt and Ukt as 


T~u,kj IP {.{.Ujt. Ujt) ( u kt; U k ) 0) IP (iV'jt Uj ) (.Ukt U k ) 0), 


where u 3 and u k are independent copies of Uj t and u k t respectively. Let R„ be the correlation 
matrix of u t , and D (1 be the diagonal matrix of the individual standard deviations of {uj t }. 
Then H u = D (1 R U D U . For elliptical distributions, we have 


Ru [R?i./cj] 


7T 

Sill | ~T Utkj 


(46) 


Under the conditional sparsity condition, R,, is a sparse matrix. 

Given the “estimated residuals” {uu}, the sample version Kendall’s tau statistic is 


T u,kj — 


T(T — 1) “ 


sign (( u kt ~ Ukt'){u jt ~ ')j 


for all k 7 ^ j, and f U: kj = 1 otherwise. We can plug r U)k j into (46) and obtain a rank-based 
error correlation estimator R„ = [R.,,^] = [sin (| T Uj kj)\ ■ We then apply thresholding on R„ 
to produce a sparse matrix estimator: 


iT _ mT \ 6T * J "’ 1 J] 


T> / _ /T3 / \ 13 / _ 

Jpxp') ^Uflj 


h'(^ J U,kj j wt), if]- 


Here h{.\ ut) is a general thresholding rule as described in Section 2, with a properly chosen 
threshold value c ot- The entry-dependent threshold can also be used. Alternatively, we can 
apply the nearest positive definite projection to produce a sparse covariance estimator based 
on R u . 

Given the estimated residuals, standard deviations in D u can be estimated similarly as 
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before. Specifically, let rhj be the estimators of E u 2 - t by solving: 



where K max is an upper bound of max^ Var(u^). Then the rank-based estimator of D„ is a 
diagonal matrix D, t , whose diagonal elements are a u j = \J max {rhj, j. where K mm is a 

lower bound of rnirij E u? t and is assumed to be known. This leads to the rank-based error 
covariance estimator: 

S u — D,iR3"D u . 

When the factors are observable, the residuals should be obtained by estimating B. The 
robust regression estimator B can be employed, e.g., L\ regression. With the estimated B, 
we set u t = Y t - B ft. The final factor-based covariance estimator is then given by: 

S = BCov(/ t )B' + £ u . 

The resulting estimator is expected to naturally handle heavy-tailed data. 

When the common factors are latent, they need to be estimated using robust PCA (that 
is, applying PCA on the rank covariance matrix of Y t ). The theoretical properties of such 
hybrid estimators are left for future investigations. 
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